Final Report on Brooklyn Housing Dataset (2003-2017)

Class: Data Science 1 with R (STAT 301-1)

Author

Brian Dinh

Published

November 28, 2023

1 Introduction

For this report, I explored a dataset labelled “Brooklyn Home Sales 2003 to 2017,” which describes information in regards to all buildings, residential and nonresidential, sold in the New York borough of Brooklyn. The data comes from the government of the state of New York, and links to the data can be found under References (Section 5). Additionally, the cleaned dataset could not be uploaded to GitHub for the same reason, so I have added a Google Drive link to it here.

For this exploratory data analysis, I was primarily motivated by my curiosity in what could be the main motivators for selling more buildings in the Brooklyn market, especially since buildings in this area tend to be more expensive than the rest of the United States due to its urbanity and access to New York City as a whole. Additionally, I wished to look at what factors could have affected housing price increases in the area. Last of all, I wanted to look at what areas in Brooklyn are the most desired based on the popularity in sales in this dataset and what neighborhoods have become less popular.

2 Data Overview and Quality

For the original “Brooklyn Home Sales 2003 to 2017” dataset, there are 111 variables and 390,883 observations. There are 32 categorical variables, 71 numerical variables, 7 logical variables, and 1 date variable.

There are missingness issues for many columns associated with geographic mapping information, borough data (e.g. what borough the building is in), and some building information (like total units, building stories, etc.). Due to this missingness, I may be limited in my analysis of nonresidential buildings, as thorough building information could affect nonresidential building prices significantly. I do not believe the missing borough data will affect my analysis because all of the buildings sold are located in Brooklyn.

The dataset is not in the GitHub for this final project as it is too large (207.6 MB) to commit to the GitHub, so please refer to the hyperlink above in Introduction (Section 1) for access to the original dataset and the cleaned dataset.

The cleaned dataset has 390,833 observations and 81 columns. There are 19 categorical variables, 61 numeric variables, and 1 date variable.

3 Explorations

3.1 How many buildings are being sold in Brooklyn over time?

Before exploring this question, I wanted to check to see the different types of buildings being sold in Brooklyn, because my initial hypothesis assumes that the building type can greatly affect certain data about the building. The variable of tax_class_at_sale categorizes the type of building sold into 4 categories according to the state of New York:

  • (Class 1): Includes most residential property of up to three units (such as one-, two-, and three-family homes and small stores or offices with one or two attached apartments), vacant land that is zoned for residential use, and most condominiums that are not more than three stories.
  • (Class 2): Includes all other property that is primarily residential, such as cooperatives and condominiums.
  • (Class 3): Includes property with equipment owned by a gas, telephone or electric company.
  • (Class 4): Includes all other properties not included in class 1,2, and 3, such as offices, factories, warehouses, garage buildings, etc.

Figure 1: Distributions of Building Sold in Brooklyn by Tax Classes

(a) Counts of Tax Classes of Buildings Sold in Brooklyn, 2003-2017

(b) Distribution of Brooklyn Building Sale Prices by Tax Class and Time, 2003-2017

(c) Distribution of Brooklyn Building Square Feet by Tax Class and Time, 2003-2017

According to Figure 1 (a), it appears that the most actively sold buildings are primarily residential buildings in classes 1 and 2, with class 4 buildings coming third and class 3 buldings not being sold too often. Additionally, class 4 buildings appear to cause large skews in both price and square feet, as according to Figure 1 (b) and Figure 1 (c), the majority of larger and more expensive buildings were class 4.

Thus, I will separate my dataset into residential (class 1 and class 2) buildings and non-residential (class 3 and class 4) buildings, with the majority of my analysis focusing on the residential dataset.

After separating the dataset, I wanted to look at how the number of sales for buildings have changed year by year, as I want to see if there are any periods of downturns or relatively high counts of building sales.

Figure 2: Distribution of Brooklyn Building Sales by Year

(a) Number of Residential Building Sales in Brooklyn by Year

(b) Number of Non-Residential Building Sales in Brooklyn by Year

In both Figure 2 (a) and Figure 2 (b), the distribution of sales in Brooklyn tended to be the highest from 2003 to 2006, while there was a significant dip in sales from 2007 to 2010, which was stronger for residential buildings than it was for non-residential buildings. From to 2011 to 2017, there appears to have been a recovery in building sales. These three time periods are interesting to look at in terms of determining if the three time periods had the same trends or not. For example, one immediate question I had was if the decrease in sales was due to an increase or decrease in price in the building market. To analyze this, I made the following line plots, which analyze the average sale prices for buildings sold in Brooklyn, grouped by year.

Figure 3: Average Building Sale Prices by Year in Brooklyn

(a) Average Residential Building Sale Price by Year, Brooklyn

(b) Average Non-Residential Building Sale Price by Year, Brooklyn

Looking at Figure 3 (a) and Figure 3 (b), it appears that the average sale price of buildings in Brooklyn actually decreased in the 2007-2010 range, which means that my hypothesis of lower sales meaning higher prices was incorrect. To investigate further, let us look at the price ranges of the buildings sold during the three time periods of 2003-2006, 2007-2010, and 2011-2017.

Figure 4: Distribution of Count of Residential Buildings Sold in Brooklyn by Price

(a) 2003-2006

(b) 2007-2010

(c) 2011-2017

In Figure 4 (a), Figure 4 (b), and Figure 4 (c), the most common selling price in Brooklyn ranges from the $300k to $500k range, with this being the most prominent within Figure 4 (a). However, with Figure 4 (b), the number of homes sold at slightly above the $300k to $500k range significantly decreased, which could have caused the average price decreases for this time period. Within the 2011 to 2017 range, there has been a more subtle increase in properties being sold in the $700k to $1m range despite buildings in the $300k to $500k range being the most common, which could explain the immense increases in residential building prices over time.

3.2 What affects the price for buildings?

For the next part of my EDA, I wanted to look at what affects the price of buildings in Brooklyn. First off, I wanted to analyze the relationship between square feet and price using a scatter plot.

Figure 5: Brooklyn Residential Building Sale Price compared to Gross Square Ft

In Figure 5, looking at the linear fitting line, it appears that there is a general positive relationship between price and square feet, meaning that the more square feet a property has, the more expensive it is, which was expected. However, I also wanted to see if residential buildings, which as seen in Figure 3 (a) are getting more expensive over time on average, are also getting more square feet as a result.

Figure 6: Distribution of Square Feet in Brooklyn Residential Building Sales

(a) 2003-2006

(b) 2007-2010

(c) 2011-2017

There appears to not be much of a difference between average square feet in all three time periods according to Figure 6 (a), Figure 6 (b), and Figure 6 (c), which means that there must be some other variable that is causing the general increase in average prices of home sales in Brooklyn.

One such variable I thought would be a major aspect of determining building price was the age of the building, since older buildings could be worse quality and therefore not as desired. First off, let’s look at the distribution of buildings sold from 2003 to 2017 by age.

Figure 7: Distribution of Residential Buildings by Age Brackets

Based on Figure 7, the residential buildings that were sold in Brooklyn the most were heavily older than 75 years old, although there is still a large bulk of buildings being sold that are 0 to 15 years old. Next, I wanted to see if the age affected the price strongly or not.

Figure 8: Average Residential Building Price in Brooklyn by Building Age

Analyzing Figure 8, it appears that the newer buildings (0 to 15 years old) are around the same price as the highly popular 75 to 90 and 90 to 105 range yet are not as sold as often. There also is an interesting dip in average building prices for the 25 to 75 year old building range, and a somewhat exponential growth in building price after 105 years, which could be due to limited samples in buildings that old.

Another variable I wanted to look at was prox_code, which identifies the proximity of the property to another property. prox_code is split into three categories: detached, semi-attached, and attached. My initial assumption is that detached homes are the most popular due to having more privacy.

Figure 9: Distribution of Residential Buildings by Proximity Code

(a) Yearly Average Residential Building Prices in Brooklyn, Divided by Proximity Codes

(b) Count of Residential Buildings Sold in Brooklyn from 2003-2017, Divided by Proximity Codes

Figure 9 (a) reveals that on average, detached and attached buildings were around the same price points from 2003 to 2017, while semi-attached buildings tend to be cheaper than other properties. However, in Figure 9 (b), semi-attached buildings are the 2nd most sold, with attached buildings being the highest and detached buildings being the least sold. This information may indicate that the cheapness of semi-attached properties could result in more transfers of properties or sellings.

4 Conclusions

After going through this EDA, I learned from the first section that the majority of buildings sold in Brooklyn are primarily residential, with the number of sales decreasing from the time period of 2007 to 2010 relative to 2003 to 2006 and slightly recovering in the 2011 to 2017 range. Post 2007 to 2010, the average sale prices for both residential and non-residential buildings have increased significantly, with average prices for residential buildings in 2017 hovering around $700,000, in comparison to the average prices for residential buildings in 2007 ranging around $450,000. I was surprised about the decreases in sales and average prices of buildings from 2007 to 2010, since I expected that the cause of buildings being sold less would be because they were too expensive.

In the second section of the EDA, I learned that despite gross square feet being partially positively correlated with sale price, there was not a significant increase in square feet of the residential buildings being sold from 2003 to 2006 in comparison to 2011 to 2017, even with the significant increases in price. I found this surprising because I expected that if one paid for a more expensive house, the house would be larger. Additionally, semi-attached residential buildings appear to be sold the cheapest on average, in comparison to attached and detached buildings. Last of all, I learned that the most expensive and most sold residential buildings were typically young (0 to 15 years old) or quite old (75 to 100 years old), while middle-aged buildings (40 to 60 years old) were relatively the cheapest. I was surprised by middle-aged buildings being cheaper than much older buildings, since I thought that there would be a consistent downward trend between price and building age.

In the third and final section of my EDA, I learned that the neighborhoods that had the most residential properties sold had relatively lower average sale prices and higher square feet than more expensive and less sold residential properties. Additionally, the neighborhoods with the most residential properties sold tended to remain the same for all three time periods of 2003 to 2006, 2007 to 2010, and 2011 to 2017. Specifically, Bedford Stuyvesant, East New York, Borough Park, and Bay Ridge were in the top five properties quite often, which suggests that their price point and size could be quite attractive to buyers. I expected the lower priced neighborhoods to be the most sold as there would be less of a barrier to obtain them, although I was surprised by the lower amounts of average square feet for the expensive neighborhoods buildings, since I thought a more expensive neighborhood would have more space.

Some additional points to look at for further exploration would be adjusting sale price with inflation to see if the values are actually starkly different or not, in addition to more analyzing economic data that could be compared with trends in housing, like unemployment or GDP. Additionally, having more up-to-date and complete geographic data of the addresses given in this dataset could allow for a more visual mapping of movement of sales over time with a Shiny app and slider. Lastly, adding data from 2018 to 2023 (as this dataset is being constantly added to) would serve as an interesting analysis and look into how building sales were affected by COVID-19, like how building sales were affected by the recession here from 2007 to 2010.

5 References